GPU Acceleration of Small Dense Matrix Computation of the One-Sided Factorizations

نویسندگان

  • Tingxing Dong
  • Mark Gates
  • Azzam Haidar
  • Piotr Luszczek
  • Stanimire Tomov
چکیده

Various scienti€c applications use Gaussian elimination or Cholesky or QR factorization to solve dense linear systems. For an important class of problems, a relatively large number of small size systems is generated and must be solved. Typically, the order of these linear systems is up to a few hundred, and their number is from a few thousand to millions. For example, subsurface transportation simulations have a number of reaction systems to solve. Each system involves computing a Jacobian matrix and iteratively applying Gaussian elimination until an outer solver converges. Œe system size is typically around 100. As another example, consider an astrophysics ODE solver with Newton-Raphson iteration [1]. Multiple zones are simulated in one MPI task and each zone corresponds to a small linear system with each one resulting in multiple sequential solves [1]. A sparse direct solver called MA48 solves a sparse nonsymmetric system of m linear equations in n unknowns using Gaussian elimination. Œe typical matrix size is 150 by 150. If the matrix is symmetric and de€nite, the problem is reduced to batched Cholesky factorization [2]. Other examples include hydrodynamic simulations, e.g., where the need is to compute thousands of matrix-matrix multiplies (dgemm) for dimensions well below 100 by 100 [3]. Œe one-sided factorizations such as the Cholesky, LU, and QR factorizations are based on block outer-product updates of the trailing matrix. Algorithmically, this corresponds to a sequence of two distinct phases: the panel factorization and the trailing matrix update. Œe panel factorization is latency and memory-bound due to its predominant reliance on the Level 2 BLAS operations. Œe implementation from the MAGMA library performs the panel factorization on the CPU and only uses the GPU to update the trailing matrix. A data transfer of the factorized panel from the CPU to the GPU is required at each step of the outermost loop. However, in the batched LU implementation we cannot a‚ord such a memory transfer at any step, since the trailing matrix is small and the amount of computation is not sucient to overlap it in time with the panel factorization. Many small data transfers will take away any performance advantage enjoyed by the GPU, especially due to the fact that the data for transfer are not continuous in the memory but instead are stored with a stride called a leading dimension. Another challenge in achieving good performance is the pivoting, which is a source of thread divergence and noncoalesced memory accesses. Œis is the result of consecutive threads accessing the matrix elements with a stride of one column instead of a single-element stride when the matrix is stored in column-major format.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Batched matrix computations on hardware accelerators based on GPUs

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations....

متن کامل

One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators

One-sided dense matrix factorizations are important computational kernels in many scientific and engineering simulations. In this paper, we propose two extensions of both right-looking (LU and QR) and left-looking (Cholesky) one-sided factorization algorithms to utilize the computing power of current heterogeneous architectures. We first describe a new class of non-GPU-resident algorithms that ...

متن کامل

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA1

One-sided dense matrix factorizations are important computational kernels in many scientific and engineering simulations. In this paper, we propose two extensions of both right-looking (LU and QR) and left-looking (Cholesky) factorization algorithms to utilize the computing power of current heterogeneous architectures. We first describe a new class of non-GPU-resident algorithms that factorize ...

متن کامل

A CPU-GPU hybrid approach for the unsymmetric multifrontal method

Multifrontal is an efficient direct method for solving large-scale sparse and unsymmetric linear systems. The method transforms a large sparse matrix factorization process into a sequence of factorizations involving smaller dense frontal matrices. Some of these dense operations can be accelerated by using a graphic processing unit (GPU). We analyze the unsymmetricmultifrontalmethod fromboth an ...

متن کامل

Evaluating one-sided programming models for GPU cluster computations

The Global Array toolkit (GA) [1] is a powerful framework for implementing algorithms with irregular communication patterns, such as those of quantum chemistry. On the other hand, accelerators such as GPUs have shown great potential for important kernels in quantum chemistry, for example, atomic integral generation [2] and dense linear algebra in correlated methods [3]. Integration of the globa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014